DataRobot API in R
Disclaimer: The appearance of U.S. Department of Defense (DoD) visual information does not imply or constitute DoD endorsement. The views expressed in this presentation are those only of the author and do not represent the official position of the U.S. Army, DoD, or the federal government.
Personal Introduction
Who Am I?
- Education
- United States Military Academy 07, Bachelor of Science in Operations Research
- Missouri University of Science and Technology, Master of Science in Engineering Management
- THE Ohio State University, Master of Science in Industrial and Systems Engineering
- THE Ohio State University, Graduate Minor in Applied Statistics
- Work
- Schofield Barracks, Hawaii, Engineer Platoon Leader / Executive Officer / Assistant S3 (Operation IRAQI FREEDOM)
- White Sands Missile Range, New Mexico, Engineer A S3/Commander (Operation ENDURING FREEDOM)
- West Point, New York, Assistant Professor
- Fort Belvoir, Virginia, Operations Research Systems Analyst / Data Scientist
General Introduction
What are we doing here?
- Introduce DataRobot in GUI (Graphical User Interface)
- Show DataRobot through the API (Application Programming Interface) with R
- Incorporate some baseball
Who will get the most out of this presentation?
This is for someone who is…
- familiar with R,
- familiar with DataRobot, or
- familiar with general machine learning workflow.
What should you expect to gain from this?
From this presentation, you should gain…
- and understanding of how the DataRobot API in R Process works,
- why you would want to do this, and
- where to go to get more information.
DataRobot Introduction
What is AutoML?
AutoML: Auto Machine Learning is a process that automates the steps of building machine learning models.
These include
- Splitting data into test/train partitions
- Handling missing data
- Centering / Scaling Data
- Determine best loss function to minimize
- Tuning parameters through cross validation
- Selecting different models
- Making predictions on new data
Benefits
- AutoML can find high performing models quickly
- Creating models requires little experience
Drawbacks
- Users not understanding recommended model
What is DataRobot?
According to the DataRobot webpage, “DataRobot is the leading end-to-end enterprise AI platform that automates and accelerates every step of your path from data to value.”
PAE (Program Analysis and Evaluation) hosts a version of DataRobot on the cPROBE cloud at https://deep-green.train.cprobe.army.mil/
Practical Example
Data Introduction
Hall of Fame Baseball Data originated from The Lahman Data Set
Data contains:
- Career statistics from players who played at least 10 seasons
- Last year tracked from 1880-2016
- Statistics accurate as of 2016 if player happened to play longer than 2016
- Hall-Of-Fame Selection status (0 for no, 1 for yes)
historical- Players who have either made the Hall-Of-Fame or are no longer eligible (careers ended at or before 2003)eligible- Players who have not made the Hall-Of-Fame who are still eligible
historical <- read_csv("01_data/historical_baseball.csv") ## read in data
eligible <- read_csv("01_data/eligible_baseball.csv") ## read in dataSplit into Train/Test
split <- rsample::initial_split(historical, strata = "inducted", prop = .5) ## creates a data split object
training <- rsample::training(split) ## extracts the training data from the data split object
testing <- rsample::testing(split) ## extracts the testing data from the data split objectHall-Of-Fame / Non-Hall-Of-Fame Breakdown
# A tibble: 2 × 3
inducted train test
<dbl> <int> <int>
1 0 1496 1501
2 1 84 80
DataRobot on cPROBE
DataRobot on cPROBE provides the traditional interactive tool to build many machine learning models according to preset conditions.
After you upload data, you set up a few modeling parameters.
After modeling is complete, it recommends a ‘best model’.
Select a model, upload data, execute predictions, then download the predictions.
DataRobot Through R API
Connect To Data Robot
library(datarobot)
ConnectToDataRobot(endpoint = "https://deep-green.train.cprobe.army.mil/api/v2",
token = Sys.getenv("key")
) ## connects to data robotGet an API Key Step 1
Get an API Key Step 2
Optional: Place Key in .Renviron File
Start Project
proj_name <- str_c("baseball_hof_",lubridate::today(),"_exhibition") ## create project nameStartProject(
dataSource = training, ## specifies data - can be dataframe, csv, zip file
projectName = proj_name, ## name of project
wait = TRUE, ## keeps console engaged during datarobot execution
maxWait = 1000, ## how long to wait for completion in seconds
verbosity = 1, ## how much feedback in the console: 1 - lots. 0 - none.
checkInterval = 5, ## how often in seconds it provides and update of progress
target = "inducted", # target column from data
metric = "LogLoss", # loss function to optimize
# rmse
mode = "quick", # datarobot's mode of model search
# "auto", "manual" or "quick"
targetType = "Binary", # target variable type
# "Multiclass", "Regression", "Regression",
positiveClass = 1, # if binary, what is "yes" in the data
workerCount = "max" # how many workers execute models in parallel
)Look Up Project
All Projects
project_list <- ListProjects() %>% as.data.frame() ## returns list of projects
project_list %>% DT::datatable() ## shows list of projectsProject of Interest
project_list %>% filter(projectName == proj_name) %>% DT::datatable() ## filter for project of interestExtract Project ID
proj_id <-
project_list %>%
filter(projectName == proj_name) %>% ## filter for project of interest
pull(projectId) ## 'pull' out the column 'projectId' which has our project id
proj_id[1] "6142228471d20135f56a4159"
Find Project Status
GetProjectStatus(project = proj_id) ## finds the status of the project$autopilotDone
[1] TRUE
$stageDescription
[1] "Ready for modeling"
$stage
[1] "modeling"
Update Owners
Share(object = GetProject(project = proj_id), c("first.m.last.mil@mail.mil"), role = "OWNER")
## shares project with other usersMake Predictions
Upload Prediction Data and Make Predictions
UploadPredictionDataset(project = proj_id, dataSource = testing) ## uploads the testing data for predictionExtract Upload Data Information
dataset_info <-
ListPredictionDatasets(project = proj_id) %>% as_tibble() ## lists the prediction data information
dataset_info %>% DT::datatable()Determine Recommended Model
recommended_model <-
GetRecommendedModel(project = GetProject(project = proj_id), ## specify project id
type = RecommendedModelType$RecommendedForDeployment ## specify which model to request
) ## extracts the name of the recommended model
# RecommendedModelType$MostAccurate
# RecommendedModelType$FastAccurate
# RecommendedModelType$RecommendedForDeploymentGet Projections
predict_job_id <-
RequestPredictions(
project = proj_id, ## provide project id
modelId = recommended_model$modelId, ## provide recommended model id
datasetId = dataset_info$id ## specify dataset to run prediction from best model
) ## kicks off predictions in data robot
predictions <-
GetPredictions(project = proj_id, ## specify the project id
predictId = predict_job_id, ## specify the prediction job from previous code
type = "raw" ## raw specifies we want to predictions to be probabilities
) ## extracts the predictions from datarobot
predictions %>%
mutate(positiveProbability = round(positiveProbability, 2)) %>% ## round predictions to two decimals
DT::datatable() ## view the predictionsAssess Performance
Join Predictions with Testing Data
metric_data <-
testing %>% ## take the testing data
mutate(prob = predictions$positiveProbability, ## add in column of probabilities from predictions
class = predictions$prediction) %>% ## add in column of class predictions (hof or not)
select(player_id, inducted, prob, class) %>% ## select columns of interest
mutate(inducted = fct_rev(as.factor(inducted))) %>% ## reorder the factors - they defaulted wrong upon download
mutate(class = fct_rev(as.factor(class))) %>% ## ## reorder the factors - they defaulted wrong upon download
mutate(prob = round(prob, 2)) %>% ## round probabilities to two decimals
left_join(
read_csv("01_data/master.csv") %>% janitor::clean_names() %>% mutate(name = str_c(name_first," ", name_last)) %>% select(player_id, name)
) %>% ## join in name data so we can see which players below to which player_id
relocate(player_id, name)
DT::datatable(metric_data)Calculate Performance Metrics
metric_data %>%
yardstick::metrics(truth = inducted, estimate = class) ## calculate performance # A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.969
2 kap binary 0.637
metric_data %>%
yardstick::roc_auc(inducted, prob) ## calculate area under the receiver operator characteristic# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.968
Get Projections for Future Data
Make Model on all Historic Data
project_name_full <- str_c("baseball_hof_",lubridate::today(),"_exhibition_full_model") ## create project name
StartProject(
dataSource = historical, ## specifies data - can be dataframe, csv, zip file
projectName = project_name_full, ## name of project
wait = TRUE, ## keeps console engaged during datarobot execution
maxWait = 1000, ## how long to wait for completion in seconds
verbosity = 1, ## how much feedback in the console: 1 - lots. 0 - none.
checkInterval = 5, ## how often in seconds it provides and update of progress
target = "inducted", # target column from data
metric = "LogLoss", # loss function to optimize
# rmse
mode = "quick", # datarobot's mode of model search
# "auto", "manual" or "quick"
targetType = "Binary", # target variable type
# "Multiclass", "Regression", "Regression",
positiveClass = 1, # if binary, what is "yes" in the data
workerCount = "max" # how many workers execute models in parallel
)
project_list <- ListProjects() %>% as.data.frame() ## returns list of projects
proj_id <-
project_list %>%
filter(projectName == project_name_full) %>% ## filter for project of interest
pull(projectId) ## 'pull' out the column 'projectId' which has our project id
recommended_model <-
GetRecommendedModel(project = GetProject(project = proj_id), ## specify project id
type = RecommendedModelType$RecommendedForDeployment ## specify which model to request
) ## extracts the name of the recommended modelUse Model All Historical Data to Predict on Future Data
UploadPredictionDataset(project = proj_id, dataSource = eligible) ## uploads the testing data for prediction
dataset_info <-
ListPredictionDatasets(project = proj_id) %>% as_tibble() ## lists the prediction data information
predict_job_id <-
RequestPredictions(
project = proj_id, ## provide project id
modelId = recommended_model$modelId, ## provide recommended model id
datasetId = dataset_info$id ## specify dataset to run prediction from best model
) ## kicks off predictions in data robot
predictions <-
GetPredictions(project = proj_id, ## specify the project id
predictId = predict_job_id, ## specify the prediction job from previous code
type = "raw" ## raw specifies we want to predictions to be probabilities
) ## extracts the predictions from datarobotLook at Results
Other Useful Functions (non-exhaustive)
ListModels(project = proj_id)
## this shows all different models available that datarobot created
ListBlueprints(project = proj_id)
## this lists all types of models that datarobot can build for type of model required by the dataParting Thoughts
Why would I want to do this?
- Automate a model in production that updates with new data everyday
- COVID work example
- DataRobot can handle larger data sizes with API than GUI